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ABSTRACT 


Over the past decade, low graduation and retention rates have 
plagued higher education institutions. To help students graduate 
on time and achieve optimal learning outcomes, many institutions 
provide advising services supported by educational technologies. 
Accurate grade prediction is an integral part of these services such 
as degree planning software, personalized advising systems and 
early warning systems that can identify students at-risk of drop- 
ping from their field of study. In this work, we present next-term 
grade prediction models based on students’ cumulative knowledge 
and co-taken courses. The proposed models are based on a ma- 
trix factorization framework and incorporate a co-taken course in- 
teraction function to learn the influence from the co-taken courses 
on the target course. The co-taken course interaction function is 
formed by a neural network, which takes the knowledge difference 
between the co-taken courses and the target course as input, and 
outputs an influence value that will be used to predict students’ 
grades on the target course. The experimental results on vari- 
ous datasets from a U.S. University demonstrate that the proposed 
models significantly outperform competitive baselines across dif- 
ferent test sets. Furthermore, we analyze the proposed models’ 
performance with different numbers of co-taken courses as well 
as different numbers of co-taken course subjects, and highlight 
with an application case study how a student might make deci- 
sions related to selection of courses. The codes are available at 
https://github.com/Zhiyun0411/EDM. 
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1. INTRODUCTION 


For over a decade higher education institutions in the United States 
have been grappling with low graduation rates [9]. The National 
Center for Education Statistics | reports that approximately 59% 
of students who started college in 2009 were able to graduate and 
obtain a 4-year college program degree within 6 years. There is 
a pressing need for data-driven applications and services to guide 
students through academic pathways and achieve better learning 
outcomes. Many higher education institutions have implemented 
programs and services supported by educational technologies to in- 
crease overall graduation rates [17]. For example, Academic Ad- 
vising service 7 provides effective student-centered advising at Pur- 
due University. Graduation Progression Success (GPS) Advising 3 
implemented at Georgia State University helps identify at-risk stu- 
dents and have advisors respond alerts. Their reports show a 6% 
increase of 6-year graduation rate over 4 years. Our work aims to 
help students select courses for the next term by developing meth- 
ods that can provide accurate grade prediction for the courses they 
have not taken yet. 


In the past few years, many approaches have been developed for 
next-term grade prediction. One of the most popular approaches 
is matrix factorization (MF), which is inspired from the Recom- 
mender Systems (RS) literature [2, 3,7, 15, 18]. Specifically, MF 
decomposes the student-course grade matrix into two matrices con- 
taining student and course latent factors, respectively. The pre- 
dicted grade of a student on a course is given by the inner prod- 
uct of the corresponding student and course latent factors [4, 10]. 
There are other extended MF-based models which achieve better 
grade prediction results than MF. For example, Morsy et al. [8] 
proposed a Cumulative Knowledge-based Regression Model (CK) 
to tackle the next-term grade prediction problem. CK models each 
student with cumulative knowledge acquired by the student in the 


'https://nces.ed.gov 
*http://www.purdue.edu/advisors/index.html 
3http://giving.gsu.edu/student-success/ 
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Figure 1: Students’ Performance with Different Co-taken Course 
Pairs. Note: BIOL311 is course “General Genetics". CHEM313 
is course “Organic Chemistry". CS321 is course “Software Engi- 
neering". ECE301 is course “Digital Electronics". MATH114 is 
course “Analytic Geometry and Calculus". CS211 is course “Ob- 
ject Oriented Programming". MATH203 is course “Linear Alge- 
bra". CS262 is course “Low-level Programming". 
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past terms. However, among all the existing methods for next-term 
grade prediction [2,13,14], very few consider the effect of co-taken 
courses on students’ performance. 


We conduct a statistical analysis on a dataset collected from George 
Mason University in order to demonstrate the effects of co-taken 
courses on students’ performance. Figure 1 shows the true grade 
distribution of students’ on a specific course with and without en- 
rolling in another course in the same term. The course pairs we 
choose in this analysis are frequently co-occuring in our dataset. 
For each target course pair, we choose the students who take more 
than four courses in a term, including the corresponding course 
pairs. We keep the students if the other co-taken courses only share 
few topics/material as the target course pairs. Figure | shows that 
students who take BIOL311 (Genetics) with CHEM313 (Organic 
Chemistry) have fewer “F", “D" and “C" grades, and several more 
“B" grades than those students who only take BIOL311 in a term. 
Similar trend has been found for course pairs CS321 (Software En- 
gineering) and ECE301 (Digital Electronics). Moreover, students 
who take MATH114 (Calculus) with CS211 (Object Oriented Pro- 
grammming) will have more “F" grades than those students who 
only take MATH114 in a term. Students who co-take MATH203 
(Linear Algebra) and CS262 (Low-level programming) have more 
“C" grades than those students who only take MATH203 in a term. 
This shows that it can be challenging for students to take some 
courses together in a term (e.g., MATH114 and CS211, MATH203 
and CS262), while it might not cause grade drop if taking other 
course pairs together (e.g., BIOL311 and CHEM313, CS321 and 
ECE301). Thus, we assume that co-taken courses can have sub- 
stantial effect on student grades in different ways. 


In this work, we propose grade prediction models that incorporate 


both Cumulative Knowledge and Co-taken Courses (CKCC) to 
predict students’ performance in the next term. Inspired by Morsy 
et al. [8], the proposed methods model each student’s latent factors 
by cumulating the knowledge provided by the sequence of courses 
the student has taken in the past terms. Furthermore, we introduce 
a co-taken course interaction function to model the influence of the 
co-taken courses on students’ performance. The co-taken course 
interaction function is formed by a neural network which takes the 
knowledge difference between the co-taken courses and the target 
course as input, and outputs an influence value from the co-taken 
courses on the target course. We conduct comprehensive experi- 
ments on various datasets collected from George Mason University 
and thorough analysis on the effect of co-taken courses. Our ex- 
perimental results show that CKCC significantly outperforms other 
competitive baselines methods for the task of grade prediction. We 
also provide detailed case study on how our model can help student 
in course selection for the next term. 


The main contributions can be summarized as follows: 


1. We develop CKCC models on next-term grade prediction. 
The models consider both students’ cumulative knowledge 
and co-taken courses in the target term. To the best of our 
knowledge, this is the first work that learns and explicitly 
incorporates influences from co-taken courses for grade pre- 
diction. 


2. We provide a detailed case study on how our model helps 
students in course selection for the next term by compar- 
ing the performance of CKCC with different sets of co-taken 
courses. 


2. RELATED WORK 
2.1 Grade Prediction Approaches 


Methods originating from recommender systems research have at- 
tracted increasing attention in educational data mining [2,3, 13,14, 
20]. Sweeney ef al. [18, 19] applied several recommender systems 
approaches to predict next-term grades. The authors implemented 
MF-based methods including SVD, SVD-kNN and factorization 
machine and simple baseline methods including global, student, 
and course means. The work showed that MF-based methods con- 
sistently achieve better grade prediction results over the baselines. 
Elbadrawy et al. [1] developed a domain-aware grade prediction 
method with student/course-group biases. This method groups stu- 
dents based on majors and academic levels. Additionally, it groups 
courses based on course levels and course subjects. The method as- 
sumes that the students/courses in a same group tend to have simi- 
lar biases. Accordingly, this method models biases for each student 
and course group within a MF framework and achieved significant 
improvement on grade prediction performance over baselines. 


2.2 Grade Prediction based on Student His- 


torical Information 
Polyzou et al. [12] addressed the future course grade prediction 
problem with different approaches based on sparse linear mod- 
els and MF approaches. The experimental results showed that 
the course-specific regression approach achieved the best perfor- 
mance among all approaches. This method predict a student’s per- 
formance using a sparse linear combination of the grades that the 
student obtained in past courses. Morsy et al. [8] proposed a model 
named Cumulative Knowledge-based Regression Model (CK) to 
predict student’s grade on a certain course at the next term. CK 
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models each student with the cumulative knowledge he/she ob- 
tained from the sequence of courses he/she took in the past. Then 
CK calculated the inner product of the cumulative knowledge vec- 
tor of a student and the required knowledge vector of the target 
course as the predicted grade. The experimental results showed 
that CK significantly outperforms MF in grade prediction. Ren 
et al. [13] proposed a matrix factorization model with temporal 
course-wise influence to predict next term student grades. This 
model considers two components in predicting a student’s grade 
on a certain course: (i) the student’s competence with respect to 
the target course’s topics, content and requirements, etc., and (ii) 
student’s previous performance over other courses. The study con- 
cluded that considering temporal influence can significantly im- 
prove the next-term grade prediction performance. 


2.3. Neural Network in Educational Data 
Mining 

Neural networks have been applied to solve many educational data 
mining problems. For example, Sharma et al. [16] proposed a com- 
posite deep neural network to predict whether the educational video 
is lively or not. The proposed method first used a convolutional 
neural network to extract the video features, and then used a deep 
recurrent neural network to predict the human movement label in 
order to detect video liveliness. Klingler et al. [6] presented a semi- 
supervised classification pipeline that employed deep variational 
auto-encoders to detect students who are suffering from develop- 
mental dyscalculia. Piech et al. [11] introduced Deep Knowledge 
Tracing (DKT) to model student learning with Recurrent Neural 
Networks. The authors provided experiments on how to use DKT 
to detect latent structure between the assessments in the dataset. 
The models proposed in this paper tackle the challenges of next- 
term grade prediction with students’ history information (the se- 
quence of courses the student has taken) and the co-taken courses 
in the next term. The main contribution of our model is to explicitly 
incorporate the co-taken courses with in MF framework. 


3. PRELIMINARIES AND PROBLEM DEF- 
INITION 


3.1 Problem Definition 


Formally, student-course grades will be represented by G1, Gy, ..., 
Gr for a total of T terms. Each G; is a matrix, and contains the set 
of student-course grades for all students enrolled in courses within 
term t. For all the students, the set of student-course grades up to 
term ¢ can be represented by G’ = }_, Gj. The set of courses that 
student s has taken in term ¢ is represented by C,, and the set of 
grades that student s achieves in term f is represented by G;;. The 
set of courses that student s has taken up to term f is represented by 
C;, and the set of grades that student s has achieved up to term f is 
represented by G4. 


In this paper, all vectors are represented by bold lower-case letters 
and all matrices are represented by upper-case letters. Row vectors 
are represented by having the transpose superscript’, otherwise by 
default they are column vectors. A predicted value is denoted by 
having a~ symbol. Table 1 summarizes the key notations used in 
this paper. 


Given student-course grades up to term ¢ — | and the set of courses 
each student plans to take at term f, the objective of our work is 
to predict student’s grades on a specific course given the set of co- 
taken courses at term f. 


3.2 Grade Prediction based on Matrix Factor- 
ization 

MF methods factor the student-course grade matrix into two matri- 
ces containing latent factors of courses and students in a common 
knowledge space, respectively [1,12]. The dimension of the knowl- 
edge space is much lower than that of the original student-course 
grade matrix. We use ps (Ps € R*) and Qe (Ge € R*) to represent la- 
tent factors of k dimensions for student s and course c, respectively. 
Thus, the grade of student s on course c can be predicted as 


&s.c _ Ph de +bs+be. (1) 


where b, and b;, are bias terms for student s and course c, respec- 
tively. 


3.3 Grade Prediction 
Knowledge 


Morsy et al. [8] proposed the CK model which learns each stu- 
dent’s latent factors with cumulative knowledge acquired by the 
student in past terms. Specifically, CK uses two vectors to model a 
course: the provided knowledge by the course and the prerequisite 
knowledge of the course, respectively. A student’s latent factor is 
given by the knowledge accumulated from the previous course that 
the student has taken and the corresponding course grades. For- 
mally, the cumulative knowledge acquired by student s up to term 
t is represented by P., k(s)? and is given by: 


y (eMhe) Bsc!) 2) 


85,8 EG! 


with Cumulative 


Prx(s) = 


where fs .: is the term in which student s took course c, et tset) 


is an exponential time decay function with A > 0 denoting the de- 
cay rate, ky is the latent knowledge factor of course c’, and gs ¢ 


is the grade of student s on course c’. Given Pi, K(s)? CK predicts 


student s’s grade on course c in term t as follows: 


i a, 
Be= Pex(s) dc. (3) 


Note that in prior work, Ren ef al. [14] have shown that CK can 
achieve better grade prediction performance when the cumulative 
knowledge p., (s) is averaged in Eq 3. Therefore, ae is presented 


S 
as follows: 
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We refer to this model as the averaged cumulative knowledge (CK) 
model and will consider it as one of our baseline methods. 


4. METHODS 
4.1 Model Overview 


In this paper, we propose grade prediction models that incorporate 
Cumulative Knowledge and Co-taken Courses (CKCC). To predict 
student s’s grade on course c in term t, CKCC takes into account 
two factors: i) cumulative knowledge of student s up to term t — 1, 
and ii) the other courses that will be taken together with course c 
in term t. To model the first factor, we adopt the CK model as in 
Eq. 4, that is, we cumulate the provided knowledge of the courses 
which student s has taken in the past, denoted as c’, to represent 
his/her cumulative knowledge, and use a latent factor to represent 
knowledge required by course c. To model the second factor, we 
introduce an co-taken course interaction function f(-) to learn the 
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Table 1: Notations 


Notation Explanation 


m number of courses 

n number of students 

k number of latent dimensions 

P, K(s) the cumulative knowledge of student s up to term t 

ec latent factor of the required knowledge components 
of course c 

k. latent factors of the provided knowledge components 
of course c 

bs student bias term 

be course bias term 

Boe the grade of student s on course c at term t 

ts.c the academic term when student s takes course c 

G; student-course grades at term t 

G' all the student-course grades up to term t 

Gst all the grades student s obtains at term t 

Gi all the grades student s obtains up to term ¢ 

Cyt the set of courses student s chooses at term t 

Cc the set of courses student s chooses up to term t 


influence from co-taken courses, denoted as c””, on student s’s grade 
on course c in term f. 


Specifically, we use a latent vector q- to represent the knowledge 
components that course c requires. We hypothesize that the differ- 
ence of the required knowledge between two courses will cause the 
influence from one course on the other, as shown in Figure 1. Based 
on this hypothesis, the difference between q, of course c and q,” of 
aco-taken course c” can be used in f(-) to learn the influence from 

” to c. We sum up the differences between each co-taken course 
c"’ and c in order to aggregate the influence. Thus, the sum of the 
absolute values of the differences between each q.” and qc, that is, 
Lerec,,\{c} |Me" — Ac|, is used in f(-) to learn the influence from all 
co-taken courses. Note that the use of absolute values here is to 
avoid the scenarios in which the influences from different co-taken 
courses are canceled out. Thus, CKCC predicts student s’s grade 
on course c in term f as follows: 


Si 1 “Beary 
Bee = apd ». (e Metre ke : Bsc!) Act 
IGs | gece 
eine (5) 
f( y (1der —4el)), 
c" EC, \{c} 


where |q.” — q-| is the vector of absolute values of entry-wise 
difference between latent vector q,” and latent vector qe, c’ € 
Cs \ {c} indicates that course c” is one of courses taken together 
with c in term ¢. Note that in Eq. 5, the two terms share a common 
latent vector qc. 


4.2 Co-taken Course Interaction Function 

In CKCC, the co-taken course interaction function f(-) learns the 
influence on student s’s grade on course c from all the other co- 
taken courses in term t. We hypothesize that such influence can be 
nonlinear in general. Therefore, we use a feedforward neural net- 
work (FNN) [21] as f(-) to model the influence. The FNN takes 
the input as described in last section, and outputs a scalar influ- 
ence value on course c. We use hyperbolic tangent (Tanh) as the 
activation function in each layer of the FNN. Note that when there 
are no hidden layers and no nonlinearity, the FNN model learns the 
weights directly from the input layer (i.e., difference of courses) to 


Algorithm 1 CKCC: Learn 


1: procedure CKCC_LEARN 

2: Initialize k,, qe for each c 

3 7 < learning rate 

4: T <number of terms in training set 
5: A + time decay parameter 
6: 

7 

8 

9 


01,02, 03 < regularization weight 


te 2 
iter <0 
: while iter<maxIter do 
10: fort <7 do 
11: for all gi. € G, do > step 1 
12: Bs6 8s — F(Le'ecys\{c} (Ide — Gel) 
13: Prck(s) 0 
14: for all c’ € C’-! do 
—M(tse—t, Toe 
15: Pck(s) <= Pck(s) +e my NK “85 
16: Bie € Pas Da 
17: a! = Bee Be 
18: a all c! < Ci"! do 
19: kw GH ket 
nde: ie “Bs Ce M1 ky) 
20: Ge — de +1 (Pek(s) *5,¢ — 02 * Ae) 
21: for all gi. € G, do > step 2 
22: Bic = Bom Prck(s) Ie 
23: Bro © FLetecys\te}(Ider — ael)) 
24: &,¢ = bse — Fre 
25: Update ©, with Adam 
26: iter + iter+1 


return © = {{k.},{q-}}, Of 


the output layer (i.e., the influence), and the function f(-) becomes 
a simple inner product operation (parameterized by a vector). This 
simplified model is referred to as CKCC-/. Figure 2 shows the 
structure of the CKCC model. 


4.3 Optimization of CKCC 


Given the grade estimation as in Equation 5, we formulate the grade 
prediction problem for term T as the following optimization prob- 
lem: 


me y a y (ea= Be)” 
S t=1 gi .€Gt 


+ Ot (|Ke| + del) + 2 ([lKell3 + llacll3) 
+ 05 ||vec(®,)||3, 


(6) 


where © = {{k.},{qc}} represents the set of latent vectors, and 
©, represents the parameters of f(-). @1, Q2, and a3 denote the 
nonnegative weights on the regularization terms to prevent overfit- 
ting. 


The optimization process for CKCC is presented in Algorithm 1. It 
consists of two steps: The first step is to update the course param- 
eters, ie., ©, using stochastic gradient descent. The second step 
is to update f(-) parameters, i.e., @¢, with the adaptive moment 
estimation (Adam) algorithm [5]. 


5. EXPERIMENTS 
5.1 Dataset Description 


The data used in this work is obtained from George Mason Uni- 
versity. Our dataset contains two student groups: first-time fresh- 
men (FTF; i.e., students who begin their study initially at this Uni- 
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Figure 2: CKCC Model Structure 


Table 2: Dataset Statistics 


FTF student group TR student group 
#S #C #S-C #S #C #S-C 
MATH | 271 693 3,325 | 243 597 2,031 

PHYS | 144 488 2,044 73 286 905 
CHEM | 427 673 4,942 | 257 473 1,937 

IT | 430 473 5,984 [1,163 487 10,302 
CS | 819 714 16,955 | 526 435 7,840 
BIOL {1,951 1,197 22,065 [1,481 980 10,851 


Major 


#S, #C and #S-C are the number of students, courses 
and student-course pairs from Fall 2009 to Spring 2018, 
respectively. 


versity), and transfer students (TR; i.e., students who transfer to 
this University from a different one). The dataset was extracted 
in the period of Fall 2009 to Spring 2018. It includes information 
of 23,435 FTF students and 28,470 TR students across 153 majors. 
For simplicity, we use students from six different majors to evaluate 
the proposed models. These majors have different numbers of en- 
rolled students, courses, and different major syllabi. We will eval- 
uate these majors on both FTF and TR student groups. The majors 
in our experiment include: (i) Mathematical Sciences (MATH), (ii) 
Physics (PHYS), (iii) Chemistry (CHEM), (iv) Information Tech- 
nology (IT) , (v) Computer Science (CS) and (vi) Biology (BIOL). 
Table 2 shows the statistics across these majors. 


5.2 Experimental Protocols 

To assess the performance of our next-term grade prediction mod- 
els, we trained our models on data up to term T — | and make pre- 
dictions for term T. We evaluate our method for three test terms, 
i.e., Spring 2018, Fall 2017 and Spring 2017. As an example, for 
evaluating predictions for term Fall 2017, data from Fall 2009 to 
Spring 2017 is considered as training data and data from Fall 2017 
is testing data. datasets. Figure 3 shows the three different train-test 
splits. 


Spring 2018 


Fall 2017 


Training set: |_| 
Test set: [ij 


Spring 2017 


Figure 3: Different Experimental Protocols 


5.3. Evaluation Metrics 
In our experiments, we use Mean Absolute Error (MAE) to evaluate 
the predicted results in numbers. MAE is calculated as: 


= Let .<Gr I$. = Bl 


(7) 
|Gr| 


MAE 


where ae and g . are the ground-truth grade and predicted grade 
for student s on course c at term 7, respectively. Gr is the set of 
student-course grades in the T-th term, which is considered as the 
test set in our experiment. 


Moreover, since a student receives a letter grade for a course, i.e., 
A, A-, ..., F, we use the Percentage of Tick Accuracy (PTA) [12] 
as one of our evaluation metrics. During training, we map letter 
grades “A+" and “A" to the real-valued grade point number 4.0, “A- 
" to 3.67, “B+" to 3.33, etc. During testing, we map the predicted 
grade point numbers back to their closest letter grades. Then, we 
define tick as the difference between two consecutive letter grades 
(e.g., C+ vs C or C vs C-). We then compute the percentage of 
predicted grades that match the actual grades (or within 0-ticks of 
them), and those that are within 1 tick and within 2 ticks of the 
actual grades as PTAg, PTA,, and PTA», respectively. 
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5.4 Compared Methods 


Since there is no prior research on the influence of co-taken courses 
within a same term, we use the two following methods and three 
other variants of CKCC as baselines in our experiments: 


oe MF The MF model is described as Eq. 1. 
e CK The CK model is described as Eq. 4. 


e MFCC We add the co-taken course influence to the MF 
model, and obtain the Matrix Factorization with Co-taken 
Courses (MFCC) model. Specifically, the predicted grade of 
student s on course c at term ¢ is defined as 


Ko =Py de +f y (Ide — el), (8) 
c"EC'\c 


where ps; denotes the latent factors of for student s. Similar 
to the CKCC model, we optimize the MFCC model with two 
steps by alternately updating the latent factors and the model 
parameters in the mapping function f(-). 


e MFCC-/ The MFCC-/ model is a special case of the MFCC 
model where f(-) is simply an inner product (parameterized 
by a vector) instead of an FNN. 


e CKCC-/ The CKCC-/ model is described in Section 4.2. 


5.5 Parameter Learning 

The set of parameters in the optimization problem (Eq 6) includes 
the number of latent dimensions (i.e., k), regularization parameters 
(i.e., Q, Q2, and a3) and the decay rate (i.e., 2). We performed 
a grid search over all the parameters with k € {5,10,...,25}, and 
01,02,03,A © {le —3,le—2,0.1}. Note that for the CKCC and 
MFCC models, the optimal neural network structure (e.g., number 
of layers, the size of each layer) depends on the value of k. Thus, 
we swept different neural network structure parameters for every k 
value in our grid search. The neural network structures that consis- 
tently achieve good performance contain one hidden layer with 2 
or 3 hidden units. 


6. RESULTS AND DISCUSSION 


6.1 Overall Performance 
Table 3 and 4 shows the overall performance for all methods for 
both FTF and TR student groups, respectively. 


Table 3 shows that for FTF students, CKCC and CKCC-/ outper- 
form the baseline methods over most datasets. Specifically, CKCC 
outperforms the other compared methods across different exper- 
imental protocols by 4.39%, 7.01%, 3.50%, 3.87% in terms of 
MAE, PTAg, PTA;, and PTA2, respectively. Furthermore, CK 
based methods outperform MF based methods on all experimen- 
tal protocols. This table also shows that co-taken course based 
methods (MFCC, MFCC-/ and CKCC, CKCC-/) outperform their 
baseline methods (MF and CK) on all experimental protocols, re- 
spectively. This illustrates that for FTF students, both cumulative 
knowledge and co-taken courses have great influence on student’s 
performance, and the proposed methods can capture such influence 
accurately. 


Table 4 shows that CK has competitive results over TR students. 
Moreover, for MF based methods, MFCC and MFCC-/ outperform 
MF for all the experimental protocols. This illustrates that co-taken 
courses are likely to have influence on student’s performance, but 


the influence may not be as strong as it is of cumulative knowledge 
for TR students. 


6.2 Analysis on Individual Majors 

In order to understand the proposed methods’ performance on each 
major, we have tested all the aforementioned methods on different 
majors separately. We conducted this group of experiments for both 
FTF and TR students. And we use Spring 2018 as test set. We 
provide detailed experimental results in Table 5 and 6. 


Table 5 shows that the CKCC model outperforms other compared 
methods for some majors (e.g., PHYS, CS) on all metrics, but has 
weak performance on some metrics for other majors (e.g., MATH, 
CHEM). Especially for MATH major, CKCC has the highest MAE 
result while MFCC and MFCC-/ have the best MAE result. The 
reason might be that the performance of CKCC relies on the student 
historical information, and it tends to have good performance on the 
students with rich historical information. However, in the test set, 
some students in certain majors do not have much historical infor- 
mation and thus drag down the model performance. Table 6 shows 
that, for TR students, there is no method that consistently outper- 
forms others across different metrics. The reason might be that 
the diversity in student characteristics (many TR students have dif- 
ferent backgrounds) leads to diverse course selection plans among 
them. Such diversity greatly influences the performance of the dif- 
ferent models. 


6.3 Linear versus Nonlinear Mapping Func- 
tion 

As aforementioned, we have two forms of co-taken course inter- 
action function: FNN model and linear model (parameterized by 
a vector). Specifically, we compare the results for MFCC versus 
MFCC-/, and CKCC versus CKCC-I/, respectively, in order to un- 
derstand how different mapping functions f(-) influence grade pre- 
diction performance. Table 3 shows that for FTF students, MFCC-I/ 
has slightly better performance than MFCC, and CKCC-/ has com- 
petitive performance as CKCC across different experimental pro- 
tocols. Same trend has shown in table 4 for TR students. Fur- 
thermore, table 5 shows that MFCC and CKCC consistently out- 
perform MFCC-/ and CKCC-I/ across different majors for FTF stu- 
dents. This illustrates that the influence of co-taken courses for 
FTF student group can be better captured by a nonlinear model 
(i.e., FNN) than a simple linear model. Table 6 shows that for TR 
students, MFCC and CKCC don’t always outperform MFCC-/ and 
CKCC-! for different majors. The reason might be that some TR 
students will have fewer co-taken courses than those of FTF stu- 
dents, and the influence from co-taken courses can be well captured 
by a linear model. 


6.4 Performance on Different Numbers of Co- 


taken Courses 
In this section, we test the CKCC model on different data sub- 
groups with different number of co-taken courses in a term. Specif- 
ically, we take the students in the test set and divide them into five 
groups: students who take {2,3,4,5,6+} courses (6+ refers to six 
and more). We perform this experiment on each major for both FTF 
and TR students, respectively. For the sake of page limit, we only 
show the results for FTF students. Figure 4 shows the experimental 
results in terms of PTAg, PTA; and PTA). The results show that 
different majors exhibit different trends when the number of co- 
taken courses varies. For example, for CHEM and BIOL majors, 
the performance of the CKCC model on PTA improves with more 
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Table 3: Performance Comparison for All Methods on FTF students 


Method Spring 2018 Fall 2017 Spring 2017 
MAE PTA, PTA; PTA: MAE PTA, PTA; PTA> MAE PTA, PTA; PTA; 
MF 0.762 0.172 0.303 0.549 0.759 0.168 0.303 0.556 0.772 0.162 0.306 0.540 
MFCC-/ 0.756 0.180 0.320 0.565 0.745 0.186 0.331 0.574 0.757 = 0.181 0.331 0.564 
MFCC 0.763 0.175 0.317 0.573 0.753 0.188 0.322 0.573 0.760 0.173 0.317 0.565 
CK 0.726 0.190 0.330 0.575 0.724 0.184 0.336 0.575 0.727. 0.186 0.333 0.575 
CKCC-/ 0.711 0.189 0.338 0.589 0.712 0.191 0.343 = 0.589 0.717) 0.182 0.332 0.587 
CKCC 0.716 0.187 0.332 0.593 0.709 0.195 0.334 0.588 0.710 0.196 0.339 0.594 
Table 4: Performance Comparison for All Methods on TR students 
Method Spring 2018 Fall 2017 Spring 2017 
MAE __— PTAo PTA; PTA. MAE __ PTAo PTA, PTA MAE __ PTAo PTA, PTA 
MF 0.775 0.184 0.316 0.537 0.760 0.157 0.300 0.565 0.773 0.168 0.299 0.550 
MFCC-/ 0.763 0.178 0.315 0.543 0.748 0.187 0.326 0.571 0.755 0.185 0.328 0.563 
MFCC 0.761 0.174 0.321 0.544 0.754 0.177 0.330 0.580 0.761 0.177 0.316 0.569 
CK 0.753 0.268 0.400 0.586 0.770 0.259 0.389 = 0.570 0.750 0.273 0.397 0.583 
CKCC-/ 0.733 0.182 0.324 0.560 0.743 0.180 0.313 0.558 0.739 30.172 0.310 0.563 
CKCC 0.735 0.181 0.323 0.562 0.728 0.175 0.335 =0.571 0.740 0.169 0.318 0.553 
Table 5: Performance Comparison for All Methods on FTF students on Different Majors 
Method MATH PHYS CHEM 
MAE __— PTAo PTA; PTA. MAE = PTAo PTA, PTA MAE __— PTAo PTA, PTA 
MF 0.762 0.234 0.336 0.523 1.099 0.106 0.206 0.383 0.684 0.262 0.399 0.601 
MEFCC-/ 0.758 0.195 0.333 0.568 0.960 0.113 0.213 0.447 0.678 0.221 0.374 0.589 
MFCC 0.758 0.206 0.322 0.559 0.998 0.163 0.248 0.433 0.663 0.249 0.380 0.592 
CK 0.782 0.267 0.378 0.569 0.910 0.135 0.270 0.468 0.680 0.249 0.393 0.595 
CKCC-1 0.784 0.184 0.316 0.535 0.978 0.238 0.294 0.437 0.734 0.312 0.449 0.611 
CKCC 0.842 0.309 0.413 0.562 0.842 0.254 0.373 0.508 0.697 0.290 0.411 0.620 
WAAC, oe ee ee ed ee 
MAE _— PTAo PTA; PTA: MAE PTAg_ PTA; PTA) MAE PTAg = PTA; PTA) 
MF 0.655 0.201 0.36 0.623 0.723 0.190 0.346 0.595 0.687 0.253 0.411 0.626 
MFCC-/ 0.664 0.181 0.365 0.630 0.715 0.177 0.326 0.603 0.777 0.317 0.439 = 0.599 
MFCC 0.627 0.231 0.381 0.659 0.704 0.209 0.362 0.605 0.676 0.274 0.429 0.638 
CK 0.606 0.299 0.466 0.681 0.722 0.244 0.395 0.597 0.643 0.316 0.464 0.653 
CKCC-/ 0.693 0.288 0.460 0.632 0.784 0.242 0.376 0.578 0.771 0.341 0.461 0.605 
CKCC 0.600 0.310 0.465 0.692 0.696 0.256 0.395 0.612 0.660 0.329 0.467 0.649 
Table 6: Performance Comparison for All Methods on TR students on Different Majors 
Method MATH PHYS CHEM 
MAE PTA, PTA; PTA MAE PTA, PTA, PTA> MAE PTA, PTA; PTA> 
MF 0.608 0.270 0.433 0.617 0.675 0.235 0.431 0.569 0.749 0.219 0.325 0.553 
MFCC-/ 0.637. 0.270 0.418 0.610 0.669 0.216 0.353 0.588 0.634 =0.281 0.412 0.649 
MFCC 0.621 0.241 0.397 0.645 0.577 0.353 0.471 0.667 0.675 0.228 0.404 0.649 
CK 0.573 0.394 0.545 0.677 0.741 0.200 0.275 0.550 0.679 0.368 0.491 0.623 
CKCC-1 0.641 0.384 0.515 0.677 0.694 0.325 0.450 0.625 0.651 0.377. 0.500 0.667 
CKCC 0.613 0.404 0.576 0.707 0.805 0.200 0.350 0.600 0.642 0.404 0.518 0.675 
sii ee eg ee 
MAE _PTAo PTA; PTA.) MAE PTAg = ~PTA; PTA> MAE PTAg PTA; PTA) 
MF 0.614 0.217 0.405 0.662 0.836 0.175 0.302 0.538 0.711 0.200 = 0.341 0.559 
MFCC-/ 0.610 0.227 0.419 0.665 0.818 0.189 0.325 0.541 0.670 0.213 0.366 0.617 
MFCC 0.608 0.243 0.415 0.658 0.796 0.193 0.333 0.578 0.674 0.206 0.367 0.604 
CK 0.608 0.223 0.406 0.659 0.737. 0.212) =0.369 ~=—-0.577 0.695 0.226 0.370 0.600 
CKCC-1 0.598 0.235 0.426 0.659 0.756 0.184 0.343 0.599 0.679 0.228 0.384 0.600 


CKCC 0.602 0.231 0.412 0.672 0.773 0.234 =0.371 0.563 0.643 0.260 0.393 0.629 
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Figure 4: PTA Results for Different Number of Co-taken Courses on FTF students 
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Figure 5: PTA Results for Different Number of Co-taken Course Subjects on FTF students 


co-taken courses. This observation suggests that CKCC is able to 
leverage stronger influence of co-taken courses to improve its per- 
formance. However, for PHYS and CS majors, CKCC achieves 
better performance with 2, 3 or 6+ co-taken courses than with 4 
or 5 co-taken courses. We postulate that this is due to the charac- 
teristics of courses chosen within a term and their content. These 
results also indicate that CKCC is able to model co-taken courses’ 
influence despite of the number of the co-taken courses. 


6.5 Performance on Different Numbers of Co- 


taken Course Subjects 

In this section, we extract each course’s subject and test the CKCC 
model on different data subgroups with different number of co- 
taken course subjects in a term. The reason we conduct this ex- 
periment is because we assume that courses with the same sub- 
ject tend to have relevant knowledge components. Students who 
have co-taken courses from many different subjects may have wide 
knowledge diversity. This experiment aims to test the performance 
of CKCC in terms of co-taken course subjects. 


Specifically, we take the students in the test set and divide them into 
five groups: students who take courses from {1,2,3,4,5} subjects 
in a term. Since there are few students co-taking courses from 6+ 
subjects, we exclude these students in our experiment. We perform 
this group of experiment on each major for both FTF and TR stu- 
dents, respectively. For the sake of page limit, we only show the 
results for FTF students. Figure 5 shows the experimental results 
in terms of PTAg, PTA; and PTA). The results show that CKCC 
have different prediction results regarding the number of co-taken 
course subjects for different majors. For example, for CHEM, CS 
and BIOL majors, the performance of the CKCC model on PTA 
has the best performance with 1 co-taken course subject than other 
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subgroups. This observation suggests that CKCC is able to model 
co-taken courses’ influence better with less knowledge diversity in 
a term. However, for IT major, CKCC achieves better performance 
with more co-taken course subjects. And for MATH and PHYS 
majors, CKCC has better performance on 2 or 5 co-taken course 
subjects than other subgroups. We assume that this is affected by 
the characteristics of different majors. Moreover, for MATH and IT 
major, the PTA results don’t vary much comparing to CHEM and 
BIOL majors. This illustrates that for some majors, students may 
take courses from several subjects at a term, and the CKCC model 
can still well capture the co-taken courses’ influence. 


7. SIGNIFICANCE AND IMPACT 


To highlight the use-case scenario of the developed next term grade 
prediction approach using co-taken courses, we ran a simulated 
case study. Having demonstrated the prediction accuracy of these 
proposed models, the objective of this case study is to highlight 
the strengths of the proposed models in helping students to select 
courses in the future term. Implicitly we want to provide students 
information about their workload (or change in their overall grades) 
by addition of one or more courses within the next term. 


Specifically, we extract two pairs of popular co-taken courses: 
BIOL311 (“General Genetics") and CHEM313 (“Organic Chem- 
istry"), MATH213 (“Analytic Geometry and Calculus II") and 
PHYS260 (“University Physics"), and conduct a study to illustrate 
how our model can help plan students’ course selections or allo- 
cate the necessary study time. Take the course pair BIOL311 and 
CHEM313 as an example. We extract the students who take course 
BIOL311 and CHEM313 together in a term. We predict students’ 
performance on course BIOL311 using the CKCC model. We then 
eliminate course CHEM3 13 from our data set and predict the grade 
on course BIOL311 again using the CKCC model. Comparing the 
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Figure 6: Comparison Results on the Co-taken Course Influence 


predicted grades helps determine if the two courses should be taken 
together within the same term or not. The sampled students have 
a total of five courses that they are enrolled in for the particular 
term. The comparison results are shown in Figure 6 (a). It is a scat- 
ter plot of predicted grades for a student where the x-axis shows 
the performance on course BIOL311 co-taken with the CHEM313 
and the y-axis is the performance on course BIOL311 with course 
CHEM313 removed. We have conducted the same experiments for 
other course pairs using the same protocol and shown these results 
in Figure 6 (b), (c) and (d). 


In general, students’ performance will get better with the other 
course eliminated due to the reduction in workload. However, dif- 
ferent students get affected differently by the additional course. For 
students who take BIOL311 and CHEM313, some of them will 
have improvement in BIOL311 grades if they do not enroll for 
CHEM313 in the same semester. On the other hand, some stu- 
dents will not have any change in their grades for BIOL311 based 
on course CHEM3 13 (the plotted results along the diagonal). Sim- 
ilar trends can be observed in Figure 6 (b), (c) and (d) as well. In 


the Figure 6, we also highlight different cases where students grade 
changes with the removal of the particular course. Using this infor- 
mation, students can plan the set of courses that they might enroll 
for in the next term, and allocate study time accordingly. 


8. CONCLUSION AND FUTURE WORK 


In this work, we propose grade prediction models that incorporate 
both cumulative knowledge and co-taken courses (CKCC) to pre- 
dict students’ performance in the next term. The proposed models 
consider both cumulative knowledge a student has acquired after 
taking a series of courses in the passing terms, and the co-taken 
courses the student plans to take in the next term. Our experimental 
results on a dataset from George Mason University shows that the 
proposed models significantly outperform other competitive base- 
lines over most the datasets for the task of next-term grade predic- 
tion. Moreover, our experimental results show that the proposed 
model is able to capture strong influence of co-taken courses to 
improve its grade prediction performance. Furthermore, we ran a 
simulated case study to illustrate how our proposed model can help 
students in course selection for the future term. 
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In the future, we plan to take into account additive factors, such 
as instructor, student’s academic level and course’s difficulty level 
along with co-taken course information, in order to achieve more 
accurate grade prediction results. We hope such a grade prediction 
system can not only help students select courses, finish their study 
at college but also guide them in career planning in the future. 
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